This is a dataset based on a survey of tech workers about mental health in 2014. It was acquired from OSMI (Open Sourcing Mental Illness) here: https://osmihelp.org/research, and has 1400 respondents. I'm passionate about mental health and fascinated by applying DS & ML to this area, so I decided to analyze this dataset and try to create machine learning models to predict mental health outcomes based on demographic variables and workplace factors.
# import libraries
import pandas as pd
from random import random
import numpy as np
from datetime import datetime
pd.options.mode.chained_assignment = None # ignore the annoying errors
## machine learning
#scipy
from scipy import stats
from scipy.stats import randint
# preprocessing
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.datasets import make_classification
from sklearn.preprocessing import binarize, LabelEncoder, MinMaxScaler
# models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neural_network import MLPClassifier
# evaluation
from sklearn import metrics
from sklearn.metrics import accuracy_score, mean_squared_error, precision_recall_curve, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score
# visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns # seaborn for simplified/prettified plots
sns.set_theme() # apply the default theme
sns.set(rc = {'figure.figsize':(15,8)}) # apply larger size
import plotly.express as px # plotly for interactive visualizations
# read in the data
mh_df = pd.read_csv('../data/mental_health_survey.csv')
mh_df.head()
| Timestamp | Age | Gender | Country | state | self_employed | family_history | treatment | work_interfere | no_employees | ... | leave | mental_health_consequence | phys_health_consequence | coworkers | supervisor | mental_health_interview | phys_health_interview | mental_vs_physical | obs_consequence | comments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014-08-27 11:29:31 | 37 | Female | United States | IL | NaN | No | Yes | Often | 6-25 | ... | Somewhat easy | No | No | Some of them | Yes | No | Maybe | Yes | No | NaN |
| 1 | 2014-08-27 11:29:37 | 44 | M | United States | IN | NaN | No | No | Rarely | More than 1000 | ... | Don't know | Maybe | No | No | No | No | No | Don't know | No | NaN |
| 2 | 2014-08-27 11:29:44 | 32 | Male | Canada | NaN | NaN | No | No | Rarely | 6-25 | ... | Somewhat difficult | No | No | Yes | Yes | Yes | Yes | No | No | NaN |
| 3 | 2014-08-27 11:29:46 | 31 | Male | United Kingdom | NaN | NaN | Yes | Yes | Often | 26-100 | ... | Somewhat difficult | Yes | Yes | Some of them | No | Maybe | Maybe | No | Yes | NaN |
| 4 | 2014-08-27 11:30:22 | 31 | Male | United States | TX | NaN | No | No | Never | 100-500 | ... | Don't know | No | No | Some of them | Yes | Yes | Yes | Don't know | No | NaN |
5 rows × 27 columns
mh_df.info() # analyze the columns and the amount of non-null data
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1259 entries, 0 to 1258 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Timestamp 1259 non-null object 1 Age 1259 non-null int64 2 Gender 1259 non-null object 3 Country 1259 non-null object 4 state 744 non-null object 5 self_employed 1241 non-null object 6 family_history 1259 non-null object 7 treatment 1259 non-null object 8 work_interfere 995 non-null object 9 no_employees 1259 non-null object 10 remote_work 1259 non-null object 11 tech_company 1259 non-null object 12 benefits 1259 non-null object 13 care_options 1259 non-null object 14 wellness_program 1259 non-null object 15 seek_help 1259 non-null object 16 anonymity 1259 non-null object 17 leave 1259 non-null object 18 mental_health_consequence 1259 non-null object 19 phys_health_consequence 1259 non-null object 20 coworkers 1259 non-null object 21 supervisor 1259 non-null object 22 mental_health_interview 1259 non-null object 23 phys_health_interview 1259 non-null object 24 mental_vs_physical 1259 non-null object 25 obs_consequence 1259 non-null object 26 comments 164 non-null object dtypes: int64(1), object(26) memory usage: 265.7+ KB
# drop unnecessary columns that won't be used in analysis
mh_df = mh_df.drop(['comments'], axis= 1)
# replace empty or missing values with default values for each data type
default_int = 0
default_str = 'NaN'
default_float = 0.0
# create lists of columns by data type
dtype_list = mh_df.columns.to_series().groupby(mh_df.dtypes).groups
dtype_dict = {k.name: v for k, v in dtype_list.items()}
int_features = list(dtype_dict['int64'])
str_features = list(dtype_dict['object'])
# replace missing values with defaults
for feature in mh_df:
if feature in int_features:
mh_df[feature] = mh_df[feature].fillna(default_int)
elif feature in str_features:
mh_df[feature] = mh_df[feature].fillna(default_str)
else:
print(f'Error: Feature {feature} not recognized')
# clean and fill missing values in the Age column
# complete missing Age data with the median age
mh_df['Age'].fillna(mh_df['Age'].median(), inplace = True)
# replace outlier ages (<18 or >100) with minimum and maximum values
mh_df['Age'][mh_df['Age'] < 18] = 18
mh_df['Age'][mh_df['Age'] > 100] = 100
# create column for age groups
mh_df['age_range'] = pd.cut(mh_df['Age'], [0,20,30,65,100],
labels=["0-20", "21-30", "31-65", "66-100"], include_lowest=True)
# validate the data cleaning
mh_df[['Age', 'age_range']]
| Age | age_range | |
|---|---|---|
| 0 | 37 | 31-65 |
| 1 | 44 | 31-65 |
| 2 | 32 | 31-65 |
| 3 | 31 | 31-65 |
| 4 | 31 | 31-65 |
| ... | ... | ... |
| 1254 | 26 | 21-30 |
| 1255 | 32 | 31-65 |
| 1256 | 34 | 31-65 |
| 1257 | 46 | 31-65 |
| 1258 | 25 | 21-30 |
1259 rows × 2 columns
# clean the Gender column
gender = mh_df['Gender'].str.lower() # lowercase all elements
gender = gender.unique() # get list of unique values
# create gender groups for all the different responses
male_str = ["male", "m", "male-ish", "maile", "mal", "male (cis)", "make", "male ", "man","msle", "mail", "malr","cis man", "Cis Male", "cis male"]
female_str = ["cis female", "f", "female", "woman", "femake", "female ","cis-female/femme", "female (cis)", "femail"]
neither_str = ["trans-female", "something kinda male?", "queer/she/they", "non-binary","nah", "all", "enby", "fluid", "genderqueer", "androgyne", "agender", "male leaning androgynous", "guy (-ish) ^_^", "trans woman", "neuter", "female (trans)", "queer", "ostensibly male, unsure what that really means", "a little about you", 'p']
# loop through all the gender values and replace with overall gender categories
for (row, col) in mh_df.iterrows():
if str.lower(col.Gender) in male_str:
mh_df['Gender'].replace(to_replace=col.Gender, value='male', inplace=True)
if str.lower(col.Gender) in female_str:
mh_df['Gender'].replace(to_replace=col.Gender, value='female', inplace=True)
if str.lower(col.Gender) in neither_str:
mh_df['Gender'].replace(to_replace=col.Gender, value='neither', inplace=True)
# verify this cleaning worked
mh_df['Gender'].unique()
array(['female', 'male', 'neither'], dtype=object)
# replace Yes and No with True and False
train_df = mh_df.copy() # save version of the df for later
mh_df = mh_df.replace({'tech_company': {'Yes': True, 'No': False},
'treatment': {'Yes': True, 'No': False},
'care_options': {'Yes': True, 'No': False},
'family_history': {'Yes': True, 'No': False},
'self_employed': {'Yes': True, 'No': False, 'NaN': False}, # code NaN to false
'remote_work': {'Yes': True, 'No': False},
'mental_vs_physical': {'Yes': True, 'No': False},
'anonymity': {'Yes': True, 'No': False},
'obs_consequence': {'Yes': True, 'No': False}
})
Possible feature variables (used to predict)
Demographics
Employment
Possible target variables (outcomes to predict)
# graph gender demographics in this dataset
gender_df = mh_df.copy().groupby('Gender').agg({'Gender':'count'})
gender_df = gender_df.rename(columns={'Gender':'gender_count'}).reset_index()
fig = px.pie(gender_df, values='gender_count', names='Gender', title='Gender Composition of the Survey Data')
fig.show()
Above: Pie chart demonstrating gender disparities in tech. This survey seems representative of disparities in the tech industry as a whole, as nearly 80% of respondents are male.
# graph the geographic distribution of mental health responses
geo_df = mh_df.copy()
geo_df = geo_df.groupby('Country').agg({'Country':'count', 'tech_company':'sum', 'treatment':'sum'})
geo_df = geo_df.rename(columns={'Country':'count'}).reset_index()
fig = px.scatter_geo(geo_df, locations="Country", locationmode="country names",
color="tech_company", size="count",
projection="natural earth",
title="Geovisualization of Mental Health Data")
fig.show()
Above: Shows the distribution of responses to the survey, with the vast majority of respondents in the US or UK (and significantly more tech company respondents in the US).
# create bar charts of target variables
# normalize = lambda x: x.value_counts(normalize=True)
def make_pretty(styler):
styler.set_caption("Care Options at Tech Companies")
styler.background_gradient(axis=None, vmin=0, vmax=1, cmap="YlGnBu")
return styler
count_df = (mh_df.copy()[['tech_company', 'treatment', 'obs_consequence']]
.apply(pd.Series.value_counts, normalize=True))
count_df.style.pipe(make_pretty)
# count_df = mh_df.groupby('tech_company').agg({'treatment':'mean'})
# count_df = count_df.rename(columns={'tech_company':'mean'}).reset_index()
# # count_df
# fig = px.bar(count_df, x="tech_company", y="treatment", color="obs_consequence", template="plotly_white",
# title="Care Options at Tech Companies")
# fig.show()
| tech_company | treatment | obs_consequence | |
|---|---|---|---|
| False | 0.181096 | 0.494043 | 0.853852 |
| True | 0.818904 | 0.505957 | 0.146148 |
Table above: Shows the proportions of care options at tech companies. Shows that most of the dataset (81%) works at a tech company, while only 18% work outside of tech. Treatment percentages do not vary significantly between in-tech and out-of-tech. However, there does seem to be a significant difference in whether there are negative consequences for mental health conditions - out-of-tech workers report that there are negative consequences to mental health conditions in their workplace, while only 14% of tech workers reported the same.
Here, I use a Random Forest Classifier (which is based on an average of decision tree predictions) to predict whether a worker seeks treatment for a mental health condition based on whether they are in a tech company or not.
# use a random forest classifier to predict mental health outcome (treatment) based on feature (tech company)
# divide data into attributes (features or predictor variables) and labels (target variables)
X = mh_df.loc[:, ["tech_company"]].values
y = mh_df.loc[:, ["treatment"]].values
# divide into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# train and predict the model on the training data
regressor = RandomForestClassifier(n_estimators=200, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
# print out evaluation metrics
print(f"CONFUSION MATRIX:\n {confusion_matrix(y_test,y_pred)}\n")
print(f"CLASSIFICATION REPORT: \n {classification_report(y_test,y_pred)}")
print(f"ACCURACY SCORE: \n {accuracy_score(y_test, y_pred)}")
C:\Users\caevi\AppData\Local\Temp/ipykernel_12512/542900786.py:11: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
CONFUSION MATRIX:
[[ 0 134]
[ 0 118]]
CLASSIFICATION REPORT:
precision recall f1-score support
0 0.00 0.00 0.00 134
1 0.47 1.00 0.64 118
accuracy 0.47 252
macro avg 0.23 0.50 0.32 252
weighted avg 0.22 0.47 0.30 252
ACCURACY SCORE:
0.46825396825396826
C:\Users\caevi\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. C:\Users\caevi\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. C:\Users\caevi\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
The model achieved an accuracy of 46.82%, which does not seem better than chance. Further, the confusion matrix shows that there were 134 false positives and 118 true negatives, showing that this model was not able to accurately predict when a worker sought treatment based on whether they were in tech or not. This is not surprising, as this is an extremely simple model that does not have enough features to predict the outcome accurately.
Here, we add more features to see if they can improve predictive accuracy. We also use encoding to convert the categorical features into numerical data.
# encoding the data
label_dict = {}
for feature in train_df:
pre = preprocessing.LabelEncoder()
pre.fit(train_df[feature])
pre_name_mapping = dict(zip(pre.classes_, pre.transform(pre.classes_)))
train_df[feature] = pre.transform(train_df[feature])
# get label names and add to dictionary
label_key = 'label_' + feature
label_val = [*le_name_mapping]
label_dict[label_key] = label_val
# get rid of 'Country'
train_df = train_df.drop(['Country'], axis= 1)
train_df.head()
| Timestamp | Age | Gender | state | self_employed | family_history | treatment | work_interfere | no_employees | remote_work | ... | leave | mental_health_consequence | phys_health_consequence | coworkers | supervisor | mental_health_interview | phys_health_interview | mental_vs_physical | obs_consequence | age_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 19 | 0 | 10 | 0 | 0 | 1 | 2 | 4 | 0 | ... | 2 | 1 | 1 | 1 | 2 | 1 | 0 | 2 | 0 | 2 |
| 1 | 1 | 26 | 1 | 11 | 0 | 0 | 0 | 3 | 5 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 2 |
| 2 | 2 | 14 | 1 | 29 | 0 | 0 | 0 | 3 | 4 | 0 | ... | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 1 | 0 | 2 |
| 3 | 3 | 13 | 1 | 29 | 0 | 1 | 1 | 2 | 2 | 0 | ... | 1 | 2 | 2 | 1 | 0 | 0 | 0 | 1 | 1 | 2 |
| 4 | 4 | 13 | 1 | 38 | 0 | 0 | 0 | 1 | 1 | 1 | ... | 0 | 1 | 1 | 1 | 2 | 2 | 2 | 0 | 0 | 2 |
5 rows × 26 columns
# define the features
feature_cols = ['Age', 'Gender', 'family_history', 'benefits', 'care_options', 'obs_consequence', 'leave', 'work_interfere']
X = train_df[feature_cols]
y = train_df.treatment
# split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
# create dictionaries for final graph
method_dict = {}
rmse_dict = ()
# build a random forest and compute the relative importance of features
# ExtraTreesClassifier is similar to RandomForestClassifier, but is cheaper to train and sometimes more generalizable.
forest = ExtraTreesClassifier(n_estimators=250,
random_state=0)
# fit the model
forest.fit(X, y)
ExtraTreesClassifier(n_estimators=250, random_state=0)
# sort the features by their importance in the random forest model
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
labels = []
for f in range(X.shape[1]):
labels.append(feature_cols[f])
# plot the importances of features
plt.figure(figsize=(12,8))
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), labels, rotation='vertical')
plt.xlim([-1, X.shape[1]])
plt.show()
# evaluate the model on its accuracy
y_pred = forest.predict(X_test)
print(f"CONFUSION MATRIX:\n {confusion_matrix(y_test, y_pred)}\n")
print(f"CLASSIFICATION REPORT: \n {classification_report(y_test, y_pred)}")
print(f"ACCURACY SCORE: \n {accuracy_score(y_test, y_pred)}")
CONFUSION MATRIX:
[[188 0]
[ 5 185]]
CLASSIFICATION REPORT:
precision recall f1-score support
0 0.97 1.00 0.99 188
1 1.00 0.97 0.99 190
accuracy 0.99 378
macro avg 0.99 0.99 0.99 378
weighted avg 0.99 0.99 0.99 378
ACCURACY SCORE:
0.9867724867724867
Results interpretation: This model yields significant and interpretable results. It results in a model accuracy of 98.68%, which is the percentage of predictions that were correct. Further, it results in 188 true positives and 185 true negatives, and only 5 false negatives. This is an extremely accurate model.
The graph above also ranks each of the features in the model by their importance. Clearly, demographic factors like Age and Gender are the most important predictors of whether a person seeks mental health treatment. Another important factor is the family_history of mental health conditions. Factors relating to the workplace, like mental health benefits, care options, and observing negative consequences for mental health conditions, are significantly less important. However, they may still be relevant to improving the model's accuracy.